In [25]:
import graphlab as gl
gl.canvas.set_target("ipynb")

In [26]:
implicit = gl.SFrame('implicit')
explicit = gl.SFrame('explicit')
items = gl.SFrame('items')
ratings = gl.SFrame('ratings')

In [5]:
ratings.show()


Split the data into a training set and a test set

This allows us to evaluate generalization ability.


In [27]:
train, valid = gl.recommender.util.random_split_by_user(implicit)

Feature engineering

Compute the number of times each item has been rated.


In [28]:
num_ratings_per_item = train.groupby('item_id', {'num_users': gl.aggregate.COUNT})
items = items.join(num_ratings_per_item, on='item_id')

Transform the count into a categorical variable using the feature_engineering module.


In [29]:
binner = gl.feature_engineering.FeatureBinner(features=['num_users'], strategy='logarithmic', num_bins=5)
items = binner.fit_transform(items)

Convert each genre element into a dictionary and each year to an integer.


In [30]:
items['genres'] = items['genres'].apply(lambda x: {k:1 for k in x})
items['year'] = items['year'].astype(int)

In [31]:
items


Out[31]:
item_id genres title year num_users
1 {"Children's": 1,
'Comedy': 1, 'Animati ...
Toy Story 1995 num_users_4
2 {"Children's": 1,
'Adventure': 1, ...
Jumanji 1995 num_users_3
3 {'Romance': 1, 'Comedy':
1} ...
Grumpier Old Men 1995 num_users_3
4 {'Drama': 1, 'Comedy': 1} Waiting to Exhale 1995 num_users_2
5 {'Comedy': 1} Father of the Bride Part
II ...
1995 num_users_2
6 {'Action': 1, 'Thriller':
1, 'Crime': 1} ...
Heat 1995 num_users_3
7 {'Romance': 1, 'Comedy':
1} ...
Sabrina 1995 num_users_3
8 {"Children's": 1,
'Adventure': 1} ...
Tom and Huck 1995 num_users_2
9 {'Action': 1} Sudden Death 1995 num_users_2
10 {'Action': 1,
'Adventure': 1, ...
GoldenEye 1995 num_users_3
[3529 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Train models

Collaborative filtering approach that uses the Jaccard similarity of two users' item lists


In [32]:
m0 = gl.item_similarity_recommender.create(train)


Recsys training: model = item_similarity
Warning: Column 'score' ignored.
    To use this column as the target, set target = "score" and use a method that allows the use of a target.
Preparing data set.
    Data has 556371 observations with 6038 users and 3529 items.
    Data prepared in: 0.489734s
Computing item similarity statistics:
Computing most similar items for 3529 items:
+-----------------+-----------------+
| Number of items | Elapsed Time    |
+-----------------+-----------------+
| 1000            | 0.80228         |
| 2000            | 0.885286        |
| 3000            | 0.969132        |
+-----------------+-----------------+
Finished training in 1.17977s

Collaborative filtering approach that learns latent factors for each user and each item


In [33]:
m1 = gl.ranking_factorization_recommender.create(train, max_iterations=10)


Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 556371 observations with 6038 users and 3529 items.
    Data prepared in: 0.784596s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| binary_target                  | Assume Binary Targets                            | True     |
| max_iterations                 | Maximum Number of Iterations                     | 10       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 69546 / 556371 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 16.6667           | Not Viable                               |
| 1       | 4.16667           | Not Viable                               |
| 2       | 1.04167           | Not Viable                               |
| 3       | 0.260417          | Not Viable                               |
| 4       | 0.0651042         | No Decrease (1.47043 >= 1.38645)         |
| 5       | 0.016276          | 1.34543                                  |
| 6       | 0.00813802        | 1.35577                                  |
| 7       | 0.00406901        | 1.3659                                   |
| 8       | 0.00203451        | 1.37251                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.016276          | 1.34543                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
+---------+--------------+-------------------+-----------------------------------+-------------+
| Initial | 112us        | 1.38645           | 0.693158                          |             |
+---------+--------------+-------------------+-----------------------------------+-------------+
| 1       | 1.21s        | 1.33709           | 0.652715                          | 0.016276    |
| 2       | 2.58s        | 1.30773           | 0.643739                          | 0.016276    |
| 3       | 3.95s        | 1.29445           | 0.641196                          | 0.016276    |
| 4       | 5.29s        | 1.28572           | 0.639083                          | 0.016276    |
| 5       | 6.51s        | 1.2805            | 0.636927                          | 0.016276    |
| 6       | 7.69s        | 1.27567           | 0.635731                          | 0.016276    |
| 7       | 8.95s        | 1.27214           | 0.634294                          | 0.016276    |
| 8       | 10.13s       | 1.26873           | 0.633182                          | 0.016276    |
| 9       | 11.33s       | 1.26672           | 0.632232                          | 0.016276    |
| 10      | 12.94s       | 1.26386           | 0.631565                          | 0.016276    |
+---------+--------------+-------------------+-----------------------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training Predictive Error.
       Final objective value: 1.27025
       Final training Predictive Error: 0.62752

Collaborative filtering approach that learns latent factors for users, items, and side data


In [34]:
m2 = gl.ranking_factorization_recommender.create(train, 
                                                 item_data=items[['item_id', 'year']], 
                                                 max_iterations=10)


Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 556371 observations with 6038 users and 3529 items.
    Data prepared in: 0.757925s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| binary_target                  | Assume Binary Targets                            | True     |
| side_data_factorization        | Assign Factors for Side Data                     | True     |
| max_iterations                 | Maximum Number of Iterations                     | 10       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 69546 / 556371 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 12.5              | Not Viable                               |
| 1       | 3.125             | Not Viable                               |
| 2       | 0.78125           | Not Viable                               |
| 3       | 0.195312          | Not Viable                               |
| 4       | 0.0488281         | No Decrease (2.00723 >= 1.38643)         |
| 5       | 0.012207          | No Decrease (1.70097 >= 1.38643)         |
| 6       | 0.00305176        | No Decrease (1.4783 >= 1.38643)          |
| 7       | 0.000762939       | No Decrease (1.38799 >= 1.38643)         |
| 8       | 0.000190735       | 1.38582                                  |
| 9       | 9.53674e-05       | 1.38597                                  |
| 10      | 4.76837e-05       | 1.38613                                  |
| 11      | 2.38419e-05       | 1.38622                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.000190735       | 1.38582                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
+---------+--------------+-------------------+-----------------------------------+-------------+
| Initial | 85us         | 1.38643           | 0.693139                          |             |
+---------+--------------+-------------------+-----------------------------------+-------------+
| 1       | 1.54s        | 1.38538           | 0.691463                          | 0.000190735 |
| 2       | 3.09s        | 1.38529           | 0.689766                          | 0.000190735 |
| 3       | 4.64s        | 1.3855            | 0.688442                          | 0.000190735 |
| 4       | 6.17s        | 1.38603           | 0.687318                          | 0.000190735 |
| 5       | 7.68s        | 1.38688           | 0.686364                          | 0.000190735 |
| 6       | 9.21s        | 1.38799           | 0.685558                          | 0.000190735 |
| 7       | 10.74s       | 1.38946           | 0.684931                          | 0.000190735 |
| 8       | 12.60s       | 1.39114           | 0.684416                          | 0.000190735 |
| 9       | 14.37s       | 1.39332           | 0.684127                          | 0.000190735 |
| 10      | 16.60s       | 1.39561           | 0.683958                          | 0.000190735 |
+---------+--------------+-------------------+-----------------------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training Predictive Error.
       Final objective value: 1.39739
       Final training Predictive Error: 0.683917

In [35]:
m3 = gl.ranking_factorization_recommender.create(train, 
                                                 item_data=items[['item_id', 'year', 'genres']], 
                                                 max_iterations=10)


Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 556371 observations with 6038 users and 3529 items.
    Data prepared in: 0.619754s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| binary_target                  | Assume Binary Targets                            | True     |
| side_data_factorization        | Assign Factors for Side Data                     | True     |
| max_iterations                 | Maximum Number of Iterations                     | 10       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 69546 / 556371 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 10                | Not Viable                               |
| 1       | 2.5               | Not Viable                               |
| 2       | 0.625             | Not Viable                               |
| 3       | 0.15625           | Not Viable                               |
| 4       | 0.0390625         | No Decrease (1.70989 >= 1.38659)         |
| 5       | 0.00976562        | No Decrease (1.86695 >= 1.38659)         |
| 6       | 0.00244141        | No Decrease (1.42815 >= 1.38659)         |
| 7       | 0.000610352       | No Decrease (1.39472 >= 1.38659)         |
| 8       | 0.000152588       | 1.38591                                  |
| 9       | 7.62939e-05       | 1.38605                                  |
| 10      | 3.8147e-05        | 1.38615                                  |
| 11      | 1.90735e-05       | 1.38623                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.000152588       | 1.38591                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
+---------+--------------+-------------------+-----------------------------------+-------------+
| Initial | 109us        | 1.38659           | 0.693033                          |             |
+---------+--------------+-------------------+-----------------------------------+-------------+
| 1       | 2.03s        | 1.38588           | 0.688326                          | 0.000152588 |
| 2       | 4.03s        | 1.38594           | 0.686816                          | 0.000152588 |
| 3       | 6.01s        | 1.38709           | 0.685309                          | 0.000152588 |
| 4       | 7.99s        | 1.38863           | 0.684032                          | 0.000152588 |
| 5       | 9.94s        | 1.39058           | 0.682958                          | 0.000152588 |
| 6       | 11.92s       | 1.39261           | 0.682088                          | 0.000152588 |
| 7       | 13.90s       | 1.3949            | 0.681394                          | 0.000152588 |
| 8       | 16.67s       | 1.39736           | 0.680825                          | 0.000152588 |
| 9       | 19.45s       | 1.40008           | 0.680407                          | 0.000152588 |
| 10      | 22.26s       | 1.40275           | 0.680151                          | 0.000152588 |
+---------+--------------+-------------------+-----------------------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training Predictive Error.
       Final objective value: 1.40473
       Final training Predictive Error: 0.680026

Evaluation

Create a precision/recall plot to compare the recommendation quality of the above models given our heldout data.


In [40]:
model_comparison = gl.compare(valid, [m0, m1, m2, m3], user_sample=.3)


compare_models: using 297 users to estimate model performance
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.340067340067 | 0.0273701812558 |
|   2    | 0.308080808081 | 0.0478083726971 |
|   3    | 0.288439955107 | 0.0644063022978 |
|   4    | 0.273569023569 | 0.0837581789951 |
|   5    | 0.259259259259 |  0.097804796748 |
|   6    | 0.246913580247 |  0.110896121437 |
|   7    | 0.239057239057 |  0.120171306579 |
|   8    | 0.231902356902 |  0.133021390364 |
|   9    | 0.21922933034  |  0.140607202562 |
|   10   | 0.211111111111 |  0.150910548487 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.208754208754 | 0.0196167645986 |
|   2    | 0.185185185185 | 0.0325496617873 |
|   3    | 0.179573512907 | 0.0423502309465 |
|   4    | 0.172558922559 | 0.0516731283008 |
|   5    | 0.165656565657 | 0.0626777457678 |
|   6    | 0.156565656566 | 0.0708693455856 |
|   7    | 0.151996151996 | 0.0777337348093 |
|   8    | 0.144781144781 | 0.0849421423653 |
|   9    | 0.140665918444 | 0.0912976245018 |
|   10   | 0.135353535354 | 0.0963525147845 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.10101010101  | 0.00630900298195 |
|   2    | 0.0942760942761 | 0.0107194435737  |
|   3    |  0.107744107744 | 0.0214553505285  |
|   4    |  0.106902356902 | 0.0282813326662  |
|   5    |  0.104377104377 | 0.0371780264877  |
|   6    |  0.104938271605 | 0.0455064168293  |
|   7    |  0.101491101491 | 0.0499579851595  |
|   8    |  0.101430976431 | 0.0556392248383  |
|   9    |  0.104377104377 | 0.0633802772197  |
|   10   |  0.103703703704 | 0.0683569645247  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.144781144781 | 0.0120901537519 |
|   2    |  0.116161616162 | 0.0187377280862 |
|   3    | 0.0976430976431 | 0.0222244648549 |
|   4    | 0.0993265993266 | 0.0295933844723 |
|   5    | 0.0976430976431 | 0.0386882538229 |
|   6    | 0.0925925925926 | 0.0426572190333 |
|   7    | 0.0899470899471 | 0.0483395580037 |
|   8    |  0.087962962963 | 0.0517209152979 |
|   9    | 0.0845491956603 | 0.0557697749298 |
|   10   | 0.0814814814815 | 0.0601898874811 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.013468013468 | 0.00107730896088 |
|   2    | 0.0117845117845 | 0.00128640444201 |
|   3    |  0.013468013468 | 0.00318878244088 |
|   4    |  0.013468013468 | 0.0051776828255  |
|   5    | 0.0127946127946 | 0.0055468443961  |
|   6    |  0.013468013468 | 0.00650871005558 |
|   7    |  0.013468013468 | 0.00751352734827 |
|   8    | 0.0130471380471 | 0.00891898234022 |
|   9    |  0.013468013468 | 0.00981078819657 |
|   10   |  0.013468013468 | 0.0112822641671  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

In [24]:
gl.show_comparison(model_comparison, [m0, m1, m2, m3, m5])



In [ ]: